{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "OpenRec_Basics_Diversity_and_Fairness.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "private_outputs": true,
      "collapsed_sections": [],
      "toc_visible": true,
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "[View in Colaboratory](https://colab.research.google.com/github/ylongqi/openrec/blob/master/tutorials/OpenRec_Basics_Diversity_and_Fairness.ipynb)"
      ]
    },
    {
      "metadata": {
        "id": "1Ql0oQmCi_5K",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "<p align=\"center\">\n",
        "  <img src =\"https://recsys.acm.org/wp-content/uploads/2017/07/recsys-18-small.png\" height=\"40\" /> <font size=\"4\">Recsys 2018 Tutorial</font>\n",
        "</p>\n",
        "<p align=\"center\">\n",
        "  <font size=\"4\"><b>Modularizing Deep Neural Network-Inspired Recommendation Algorithms</b></font>\n",
        "</p>\n",
        "<p align=\"center\">\n",
        "  <font size=\"4\">Hands on: Intro to OpenRec + diversity and fairness</font>\n",
        "</p>"
      ]
    },
    {
      "metadata": {
        "id": "MJOeBgu7w1Qy",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "# Part1. Get started with the framework and look into per-group accuracy of recommendations"
      ]
    },
    {
      "metadata": {
        "id": "vIVn1P4lnxbb",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "Recommender systems contain multiple pieces that needed to be implemented: data processing, model training, inference. OpenRec allows to use modularized approach to implement different pieces separately and reuse them for different recommenders. \n",
        "\n",
        "## here is the sample pipeline"
      ]
    },
    {
      "metadata": {
        "id": "-7377Dd3oPa0",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "<p align=\"center\">\n",
        " <img src =\"https://s3.amazonaws.com/cornell-tech-sdl-openrec/tutorials/pipeline.png\" width=\"500\" height=“20” />\n",
        "</p>\n"
      ]
    },
    {
      "metadata": {
        "id": "cRnurYm91MrY",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## data\n",
        "\n",
        "For this tutorial we use Last.fm dataset with 1K users history. This is a toy problem and you can explore larger dataset examples in our repo (Amazon, CiteULike, Tradesy).\n",
        "\n",
        "\n",
        "![alt text](https://s3.amazonaws.com/cornell-tech-sdl-openrec/lastfm/fairness/last.png =100x50)\n",
        "\n",
        "\n",
        "* 1000 users\n",
        "* 15000 items\n",
        "* 570K entries\n",
        "\n",
        "## task\n",
        "\n",
        "1. Recommend artists that a user will like.\n",
        "2. Evaluate the performance on different user groups\n"
      ]
    },
    {
      "metadata": {
        "id": "REwn-8xRw5pv",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "!pip install tensorflow\n",
        "!pip install matplotlib\n",
        "\n",
        "!pip install openrec\n",
        "\n",
        "import urllib.request\n",
        "\n",
        "dataset_prefix = 'http://s3.amazonaws.com/cornell-tech-sdl-openrec'\n",
        "urllib.request.urlretrieve('%s/lastfm/fairness/lastfm1k.npy' % dataset_prefix, \n",
        "                   'lastfm1k.npy')\n",
        "urllib.request.urlretrieve('%s/lastfm/fairness/lastfm_artists.json' % dataset_prefix, \n",
        "                   'lastfm_artists.json')\n",
        "urllib.request.urlretrieve('%s/lastfm/fairness/lastfm_users.json' % dataset_prefix, \n",
        "                   'lastfm_users.json')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "8yGop_34BVYv",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "import tensorflow as tf\n",
        "import random\n",
        "\n",
        "random.seed(1)\n",
        "tf.set_random_seed(1)\n",
        "import numpy as np\n",
        "import json\n",
        "import matplotlib.pyplot as plt\n",
        "from collections import defaultdict\n",
        "from tqdm import tqdm_notebook as tqdm"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "bTno1U6pgc1C",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## load data and split into train and test set"
      ]
    },
    {
      "metadata": {
        "id": "KQuHXkt3Ce7p",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# load data\n",
        "\n",
        "raw_data = dict()\n",
        "raw_data['max_user'] = 992\n",
        "raw_data['max_item'] = 14598\n",
        "data = np.load('./lastfm1k.npy')\n",
        "range_ids = list(range(data.shape[0]))\n",
        "random.shuffle(range_ids)\n",
        "\n",
        "print(\"data sample:\\t\", data[0])\n",
        "print(\"data type:\\t\", data[0].dtype)\n",
        "print(\"data shape:\\t\", data.shape)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "3HBaKQn-gux1",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# split data 90:10\n",
        "raw_data['train_data'] = data[range_ids[:int(data.shape[0]*0.9)]]\n",
        "raw_data['test_data'] = data[range_ids[int(data.shape[0]*0.9):]]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "3o6BMqjCgwRH",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## create train dataset"
      ]
    },
    {
      "metadata": {
        "id": "XDMJY5RMslu6",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from openrec.tf1.utils import Dataset\n",
        "\n",
        "\n",
        "train_dataset = Dataset(raw_data['train_data'], raw_data['max_user'], \n",
        "                        raw_data['max_item'], name='Train')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "sc1TkLLnDquH",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## load users data and calculate gender balance"
      ]
    },
    {
      "metadata": {
        "id": "_PvnKMklsuDW",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "with open('./lastfm_users.json','r') as f:\n",
        "    users = json.load(f)\n",
        "gender_balance= defaultdict(int)\n",
        "for user in users['metadata']:\n",
        "    gender_balance[user['gender']] += 1\n",
        "gender_balance\n",
        "\n",
        "gender_dict= defaultdict(list)\n",
        "for pos, user in enumerate(users['metadata']):\n",
        "    gender_dict[(user['gender'])].append(pos)\n",
        "    \n",
        "## convert back to dict() to prevent accidental modifications\n",
        "gender_dict = dict(gender_dict)\n",
        "    \n",
        "# extract nan value\n",
        "nan_value = list(gender_dict.keys())[2]\n",
        "\n",
        "balance = [(x, len(y)/raw_data['max_user']) for x, y in gender_dict.items()]\n",
        "\n",
        "plt.title('Users gender balance')\n",
        "graph = plt.pie([x[1] for x in balance], labels=[x[0] for x in balance], autopct='%1.1f%%',\n",
        "        shadow=True, startangle=90, textprops={'fontsize': 18})\n",
        "\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "oyTWiTzOhNa8",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## define hyperparameters"
      ]
    },
    {
      "metadata": {
        "id": "L5cvsFM7xV-2",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "dim_embed = 10\n",
        "total_iter = int(5e3)\n",
        "eval_iter = 5000\n",
        "batch_size = 1000\n",
        "save_iter = eval_iter"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "7V4Umg59hTZY",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## create sampler"
      ]
    },
    {
      "metadata": {
        "id": "ZmxXd88-_1U7",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from openrec.tf1.utils.samplers import RandomPointwiseSampler\n",
        "# https://github.com/ylongqi/openrec/blob/master/openrec/utils/samplers/random_pointwise_sampler.py \n",
        "\n",
        "\n",
        "train_sampler_pointwise = RandomPointwiseSampler(batch_size=batch_size, dataset=train_dataset, num_process=5)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "nVG3j1syxhhc",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## create evaluators and samplers per gender for users"
      ]
    },
    {
      "metadata": {
        "id": "B6H48VFUxSRz",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from openrec.tf1.utils.samplers import EvaluationSampler\n",
        "# https://github.com/ylongqi/openrec/blob/master/openrec/utils/samplers/evaluation_sampler.py\n",
        "\n",
        "\n",
        "evaluator_sampler_dict = dict()\n",
        "for key, value in gender_dict.items():\n",
        "  # filter users by gender\n",
        "  data = np.argwhere(np.in1d(raw_data['test_data']['user_id'], \n",
        "                             value) == True)[:, 0]\n",
        "\n",
        "  # new dataset for test data\n",
        "  ds = Dataset(raw_data['test_data'][data], raw_data['max_user'],\n",
        "               raw_data['max_item'], name=str(key), num_negatives=500)\n",
        "  evaluator_sampler_dict[key] = EvaluationSampler(batch_size=batch_size, dataset=ds)\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "uDrlJrdIABdn",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from openrec.tf1.utils.evaluators import AUC, Precision, Recall\n",
        "# https://github.com/ylongqi/openrec/tree/master/openrec/utils/evaluators\n",
        "\n",
        "prec_evaluator = Precision(precision_at=[5])\n",
        "rec_evaluator = Recall(recall_at=[5])\n",
        "auc_evaluator = AUC()\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "vsyaCKgEhqVM",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## create model and model trainer"
      ]
    },
    {
      "metadata": {
        "id": "FsbvQ8D0G1rV",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from openrec.tf1.recommenders import PMF\n",
        "# https://github.com/ylongqi/openrec/blob/master/openrec/recommenders/pmf.py\n",
        "\n",
        "pmf_model = PMF(batch_size=batch_size, total_users=train_dataset.total_users(), \n",
        "                total_items=train_dataset.total_items(), \n",
        "                dim_user_embed=dim_embed, dim_item_embed=dim_embed, \n",
        "                save_model_dir='pmf_recommender/', train=True, serve=True)\n",
        "\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "eTJlvq1Wlhpa",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from openrec import ModelTrainer\n",
        "# https://github.com/ylongqi/openrec/blob/master/openrec/model_trainer.py\n",
        "\n",
        "\n",
        "model_trainer_pmf = ModelTrainer(model=pmf_model)\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "fACNVMTJh4jk",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## train simple model"
      ]
    },
    {
      "metadata": {
        "id": "Qs4VAvFFz-DF",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "model_trainer_pmf.train(total_iter=total_iter, \n",
        "                    eval_iter=eval_iter, \n",
        "                    save_iter=save_iter, \n",
        "                    train_sampler=train_sampler_pointwise, \n",
        "                    eval_samplers=evaluator_sampler_dict.values(), \n",
        "                    evaluators=[auc_evaluator, prec_evaluator, rec_evaluator])\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "MV111zBnzLPS",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## TODO. Train new model using BPR\n",
        "\n",
        "* Try to use another model (BPR)\n",
        "* Use same model parameters\n",
        "* Use same Samplers and Evaluators\n",
        "* Use different sampler (RandomPairwiseSampler)\n",
        "\n",
        "\n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "_Ja9ZG-u0ROk",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from openrec.tf1.recommenders import BPR\n",
        "from openrec.tf1.utils.samplers import RandomPairwiseSampler"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "GPYha8pI0Rjp",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "### 1. Create new sampler RandomPairwiseSampler\n",
        "### 2. Create new model BPR\n",
        "### 3. Create new model trainer with BPR model\n",
        "### 4. Train model with new sampler RandomPointwiseSampler\n",
        "### you can reuse same parameters as in previous example"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "Rk-imFA6e4ra",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "#ANSWER \n",
        "train_sampler_pairwise = RandomPairwiseSampler(batch_size=batch_size, dataset=train_dataset, num_process=5)\n",
        "\n",
        "bpr_model = BPR(batch_size=batch_size, total_users=train_dataset.total_users(), \n",
        "                total_items=train_dataset.total_items(), \n",
        "                dim_user_embed=dim_embed, dim_item_embed=dim_embed, \n",
        "                save_model_dir='bpr_recommender/', train=True, serve=True)\n",
        "\n",
        "\n",
        "model_trainer_bpr = ModelTrainer(model=bpr_model)\n",
        "\n",
        "model_trainer_bpr.train(total_iter=total_iter, \n",
        "                    eval_iter=eval_iter, \n",
        "                    save_iter=save_iter, \n",
        "                    train_sampler=train_sampler_pairwise, \n",
        "                    eval_samplers=evaluator_sampler_dict.values(), \n",
        "                    evaluators=[auc_evaluator, prec_evaluator, rec_evaluator])\n",
        "\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "5iszCiE_0yJi",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "# Part 2. Fairness. Balanced user sampling\n",
        "\n",
        "Some group of users have lower presence in dataset and therefore lower performance. \n",
        "\n",
        "In this section, we want to show that OpenRec allows researchers to experiment with sampling data during training."
      ]
    },
    {
      "metadata": {
        "id": "s7Uq58L8poWN",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "## let's see again the gender balance for users\n",
        "graph = plt.pie([x[1] for x in balance], explode=[0,0,.4], labels=[x[0] for x in balance], autopct='%1.1f%%',\n",
        "        shadow=True, startangle=90, textprops={'fontsize': 18})\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "DAjvxKHPqXWn",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "Let's look at \"nan\" users who are not well represented in dataset and try to write a function that oversamples\n",
        "them for training"
      ]
    },
    {
      "metadata": {
        "id": "FSJhLiXTtlK7",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "nan_users_list = gender_dict[nan_value]\n",
        "print('Number of users: ')\n",
        "len(nan_users_list)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "mY99G8ObuQNB",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## TODO. Rebalance NAN group to make the performance on this group similar to other groups\n",
        "\n",
        "\n",
        "* Modify Pointwise sampler to assign larger labels when users belong to NAN category\n",
        "* Measure effect of your changes\n",
        "* You can modify method parameters to make it more general\n",
        "* What are better ways to experiment with fairness?"
      ]
    },
    {
      "metadata": {
        "id": "uso5xejkpoO0",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from openrec.tf1.utils.samplers import Sampler\n",
        "\n",
        "def RandomPointwiseSamplerFair(dataset, batch_size, num_process=5, seed=100):\n",
        "    \n",
        "    random.seed(seed)\n",
        "    def batch(dataset, batch_size=batch_size):\n",
        "        \n",
        "        while True:\n",
        "            input_npy = np.zeros(batch_size, dtype=[('user_id', np.int32),\n",
        "                                                        ('item_id', np.int32),\n",
        "                                                        ('label', np.float32)])\n",
        "            \n",
        "            for ind in range(batch_size):\n",
        "                user_id = random.randint(0, dataset.total_users()-1)\n",
        "                item_id = random.randint(0, dataset.total_items()-1)\n",
        "                label = 1.0 if dataset.is_positive(user_id, item_id) else 0.0\n",
        "                ##### START YOUR CODE HERE\n",
        "                \n",
        "                # ASSIGN LARGER WEIGHTS TO ENTRIES BY NAN USERS\n",
        "                \n",
        "                #ANSWER\n",
        "                if user_id in nan_users_list:\n",
        "                  label *= 1.8\n",
        "                \n",
        "                ##### END YOUR CODE HERE\n",
        "                \n",
        "                input_npy[ind] = (user_id, item_id, label)\n",
        "            yield input_npy\n",
        "    \n",
        "    s = Sampler(dataset=dataset, generate_batch=batch, num_process=num_process)\n",
        "    \n",
        "    return s"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "PhRVfGKvpoHN",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# create new sampler\n",
        "\n",
        "train_sampler_pointwise_fair = RandomPointwiseSamplerFair(batch_size=batch_size, dataset=train_dataset, num_process=5)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "34-RtOBXpn-P",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# use sampler to train your model\n",
        "\n",
        "pmf_model_fair = PMF(batch_size=batch_size, total_users=train_dataset.total_users(), \n",
        "                total_items=train_dataset.total_items(), \n",
        "                dim_user_embed=dim_embed, dim_item_embed=dim_embed, \n",
        "                save_model_dir='pmf_recommender/', train=True, serve=True)\n",
        "\n",
        "model_trainer_pmf_fair = ModelTrainer(model=pmf_model_fair)\n",
        "\n",
        "\n",
        "model_trainer_pmf_fair.train(total_iter=total_iter, \n",
        "                    eval_iter=eval_iter, \n",
        "                    save_iter=save_iter, \n",
        "                    train_sampler=train_sampler_pointwise_fair, \n",
        "                    eval_samplers=evaluator_sampler_dict.values(), \n",
        "                    evaluators=[auc_evaluator, prec_evaluator, rec_evaluator])\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "OFI2pbJ7zAM6",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "# Part 3. Add some diversity to your items\n",
        "\n",
        "In this section, we will try to add more diversity to recommendations. We will use very simple design that boosts predictions for rare items in postprocessing step.\n",
        "\n",
        "Essentially, when the model is served it takes input: ```(user_id, item_id)```. We can find not popular items and set them "
      ]
    },
    {
      "metadata": {
        "id": "zwcpYox11JZd",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# calculate popularity\n",
        "\n",
        "item_ids, item_counts = np.unique(raw_data['train_data']['item_id'], return_counts=True)\n",
        "\n",
        "\n",
        "mean_count = np.mean(item_counts)\n",
        "\n",
        "plt.hist(item_counts)\n",
        "print(\"Mean value: \\t\", mean_count)\n",
        "plt.title('item popularity hist.')\n",
        "plt.xlabel('Number of times items are consumed')\n",
        "plt.ylabel('Number of items')\n",
        "plt.yscale('log')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "gjyw8KcK8gPa",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## let's find tail items"
      ]
    },
    {
      "metadata": {
        "id": "l1-Qf9zT6Hlo",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "print('Total number of items:\\t ', item_ids.shape)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "m3C_dT3o1iZ4",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "## create list of rare items\n",
        "rare_items = list()\n",
        "for pos, item_id in enumerate(item_ids):\n",
        "  if item_counts[pos] < 10:\n",
        "    rare_items.append(item_id)\n",
        "rare_items = np.array(rare_items)\n",
        "print('Number of rare items (less than 10 occurences in train data) :', rare_items.shape)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "9htUSSnM-iRF",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## calculate predictions for rare and popular items"
      ]
    },
    {
      "metadata": {
        "id": "jxpU8ahKzQbt",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# ### Default evaluation function:\n",
        "#\n",
        "# def _default_eval_iter_func(self, model, batch_data):\n",
        "#     return np.squeeze(model.serve(batch_data)['outputs'])\n",
        "#\n",
        "# ###\n",
        "\n",
        "\n",
        "# get a sample batch from data\n",
        "pos_items, batch_data  = evaluator_sampler_dict[nan_value].next_batch()\n",
        "\n",
        "# get entries for rare and popular items\n",
        "rare_items_in_batch = np.argwhere(np.in1d(batch_data['item_id'], rare_items) == True)\n",
        "pop_items_in_batch = np.argwhere(np.in1d(batch_data['item_id'], rare_items) == False)\n",
        "\n",
        "\n",
        "# calculate predictions for given batch\n",
        "res = model_trainer_pmf_fair._default_eval_iter_func(pmf_model, batch_data)\n",
        "\n",
        "print(f'Rare items mean prediction: {np.mean(res[rare_items_in_batch])}')\n",
        "print(f'Popular items mean prediction: {np.mean(res[pop_items_in_batch])}')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "B9SWX5f7-2gP",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## define function that boosts predictions for rare items"
      ]
    },
    {
      "metadata": {
        "id": "4erXXLW26ZJn",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "scale_parameter = 2\n",
        "\n",
        "\n",
        "def diverse_eval_iter_func(model, batch_data):\n",
        "    rare_items_in_batch = np.argwhere(np.in1d(batch_data['item_id'], rare_items) == True)\n",
        "    predictions = np.squeeze(model.serve(batch_data)['outputs'])\n",
        "    predictions[rare_items_in_batch] *= scale_parameter\n",
        "    return predictions"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "Y9tKiO0f4Tbi",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "res_diverse = diverse_eval_iter_func(model_trainer_pmf_fair, batch_data)\n",
        "print(f'Rare items mean score: {np.mean(res_diverse[rare_items_in_batch])}')\n",
        "print(f'Popular items mean score: {np.mean(res_diverse[pop_items_in_batch])}')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "HfpuWNw4zQ3i",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# train a model that has more diverse predictions\n",
        "\n",
        "pmf_model_fair_diverse = PMF(batch_size=batch_size, total_users=train_dataset.total_users(), \n",
        "                total_items=train_dataset.total_items(), \n",
        "                dim_user_embed=dim_embed, dim_item_embed=dim_embed, \n",
        "                save_model_dir='pmf_recommender/', train=True, serve=True)\n",
        "\n",
        "model_trainer_pmf_fair_diverse = ModelTrainer(model=pmf_model_fair_diverse, eval_iter_func=diverse_eval_iter_func)\n",
        "\n",
        "\n",
        "model_trainer_pmf_fair_diverse.train(total_iter=total_iter, \n",
        "                    eval_iter=eval_iter, \n",
        "                    save_iter=save_iter, \n",
        "                    train_sampler=train_sampler_pointwise_fair, \n",
        "                    eval_samplers=evaluator_sampler_dict.values(), \n",
        "                    evaluators=[auc_evaluator, prec_evaluator, rec_evaluator])\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "CBDfaa-q_mi6",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "# Congrats! \n",
        "\n",
        "You have trained a simple model and implemented your own sampler that balances users' entries and oversamples gender NAN and added very simple diversity to your recommendations.\n",
        "\n",
        "* Use it for your problems and let us know about new issues or ideas\n",
        "* Help us by contributing to the project: https://github.com/ylongqi/openrec\n"
      ]
    },
    {
      "metadata": {
        "id": "5WKQXT37njZo",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "<!-- <p align=\"left\">\n",
        " <img src =\"https://github.com/ylongqi/openrec-web/raw/gh-pages/openrec.tf1.png?raw=true\" width=\"100\" height=“20” />\n",
        "</p> -->\n",
        "\n",
        "\n"
      ]
    }
  ]
}